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Abstract 



The spatial interaction between two or more classes of points may cause spatial clustering patterns 

j^ , such as segregation or association, which can be tested using a nearest neighbor contingency table 

(NNCT). A NNCT is constructed using the frequencies of class types of points in nearest neighbor 

(NN) pairs. For tests based on NNCTs (i.e., NNCT-tests), the null pattern is either complete spatial 

randomness (CSR) of the points from two or more classes (called CSR independence) or random labeling 

(RL). The RL pattern implies that the locations of the points in the study region are fixed, while the 

CSR independence pattern implies that they are random. The distributions of the NNCT-test statistics 

depend on the number of reflexive NNs (denoted by R) and the number of shared NNs (denoted by Q), 

both of which depend on the allocation of the points. Hence Q and -R are fixed quantities under RL, 

but random variables under CSR independence. However given the difficulty in calculating the expected 

values of Q and R under CSR independence, one can use their observed values in NNCT analysis, which 

makes the distributions of the NNCT-test statistics conditional on Q and R under CSR independence. In 

this article, I use the empirically estimated expected values of Q and R under CSR independence pattern 

to remove the conditioning of NNCT-tests (such a correction is called the QR-adjustment, henceforth). 

I present a Monte Carlo simulation study to compare the conditional NNCT-tests (i.e., tests with the 

observed values of Q and R are used) and unconditional NNCT-tests (i.e., empirically QR-adjusted 

KJ^ " tests) under CSR independence and segregation and association alternatives. I demonstrate that QR- 

^ %> , adjustment does not significantly improve the empirical size estimates under CSR independence and 

j^ ' power estimates under segregation or association alternatives. For illustrative purposes, I apply the 

conditional and empirically corrected tests on two example data sets. 

Keywords: Association; complete spatial randomness; conditional test; nearest neighbor contingency table; random 
labeling; spatial clustering; spatial pattern; segregation 
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1 Introduction 



Spatial patterns have been studied extensively and have important implications in many fields such as 
epidemiology, population biology, and ecology. It is of practical interest to study the univariate spatial 
pattern s (i.e.. patterns of only one class) as well as multivariate patterns (i.e., patterns of multiple classes) 
( Pieloul (|l961l) . IWhipDld (|l980l ). and loixon (1994, 2002 )). For convenience and generality, I call the different 
types of points as "classes" , but a class label can stand for any characteristic of a measure ment at a part icular 
location. For exa mple, the spa tial segregation pattern ha s been investigated for spe cies ( Diggld ( 20031 )). age 
classes of plants ( Harnill and Wright (1986) ), fish species ( Herler and Patzneii ( 20051 )), and sexes of dioecious 
plants ( Nanami et aL ( 19991)). Many o f the e pidemiological applications are for a two-class system of case 
and control labels (jWaller and Gotwavl (|2004) ). 



In this article, for simplicity, I describe the spatial point patterns for two classes only; the extension 
to multi-class case is straightforward. The null pattern is usually one of the two (random) pattern types: 
complete spatial randomness (CSR) of two or more classes or random labeling (RL) of a set of fixed points 
with two classes. That is, when the points from each class are assumed to be uniformly distributed over the 
region of interest, the the null hypothesis is the CSR of poi nts from two classes. This typ e of CSR pattern is 
also referred to as "population independence" in literature ( Goreaud and Pelissierl ( 20031 )). In the univariate 
spatial analysis, CSR refers to the pattern in which locations of points from a single class are random over 
the study area. To distinguish the CSR of points from two-classes and CSR of points from one class, I call 
the former as "CSR independence" and the latter as "CSR" , henceforth. Note that CSR independence is 
equivalent to the case that RL procedure is applied to a given set of points from a CSR pattern in the sense 
that after points are generated uniformly in the region, the class labels are assigned randomly. When only 
the labeling of a set of fixed points (the allocation of the points could be regular, aggregated, or clustered, 
or of lattice type) is random, the null hypothesis is the RL pattern. 



Many tests of spatial segregation have been developed in literature (jOrtonI (J1982I )). These include com- 
parison of Rip l ey's K(t) or L(t) func t ions (RiplevI ( 20041 )), comparison of nearest neighbor (NN) di stances 



Diggld ( 2003 ). Cuzick and Edwarda (|l990( )), and analysis of nearest neighbor contingency tables (jPielou 



Uigglel (IzOOdll. ILyUzick and Jj^dwardsl ([lyyyjj, and analysis ot nearest neighbor contingency tables (|rielou 
196lh . lMeagher and Burdicla (|l980t )). Nearest ne ighbor contingen cy tables (NNCTs) are constructed using 



the frequencies of classes of points in NN pairs. iKulldorfa ( 2006f ) provides an extensive review of t ests of 
spatia l randomness that adjust for an inh o moge neity of the densities of the underlying populations. iPielou 
( 19611 1 proposed various tests and (JDixonI ([l99J)) introduced an overall test of segregation, cell-, an d class- 
specifi c tests based on NNCTs for the two-class case and extended his tests to multi-class case (IDixon 



( 2002 )). These tests based on NNCTs (i.e., NNCT-tests) were designed for testing the RL of points. iPielou 
19611 ) used the usual Pearson's x^-tcst of independence for detecting the segregation of the two classes. 



Due to the ease in computation and interpretation, Pielou's test of segregation is frequently used for both 
CSR indepe ndence and RL patter n s. However it has b een shown that P ielou's test is not appropriate for 
testing RL (JMeagher and Burdickl (jl980[ ). iDixonI (|l994l )). iDixonI (|l994l ) de rived th e appr opriate (asymp- 
totic) sampling distribution of cell counts using M oran jo i n coun t statistics ( MoranI ( 19481 )) and hence the 
appropriate test which also has a x^-distribution (JDixonl (|l994l )). For the two-class case, ICevhanI (|2006f ) 
compared these tests, extended the tests for testing CSR independence, and de monstrated tha t Pielou's 
tests are only appropriate for a random sample of (base, NN) pairs. Furthermore, ICevhanI ( 2007 ) proposed 



three new overall segregation tests. Since Pielou's test is n ot appropriate, NNCT-tests only refer to Dixon's 
overall test and the three new segregation tests proposed bv lCevhanI (|2007[ ). However the distributions of the 



NNCT-tcst statistics depend on the number of reflexive NNs (denoted by R) and the number of shared NNs 
(denoted by Q), both of which depend on the allocation of the points. Hence Q and R are fixed under RL, 
but random under CSR independence. But expectations of Q and R seem to be not available analytically 
under the CSR independence, so their observed values were used bv lCevhanI ( 2007 ). In this article, I replace 
the expectations of Q and R by their empirical estimates under CSR independence. Such a correction for 
removing the conditional nature of NNCT-tests is called "QR-adjustment", henceforth. 



The NNCT-tests are designed for testing a more general null hypothesis, namely. Ho : randomness 
in the NN structure, which usually results from CSR independence or RL. The distinction between CSR 
independence and RL is very important when defining the appropriate null m odel for each empirical case, 
i.e., the null model depends on the particular context. ICoreaud and Pelissien ((2003) discuss the differences 



between these two null hypotheses and demonstrate that the misinterpretation is very common. They 
assert that under CSR independence the (locations of the points from) two classes are a priori the result of 
different processes (e.g., individuals of different species or age cohorts), whereas under RL some processes 
affect a posteriori the individuals of a single population (e.g., diseased versus non-diseased individuals of a 
single species). Notice that although CSR independence and RL are not same, they lead to the same null 
model (i.e., randomness in NN structure) for NNCT-tests, since a NNCT does not require spatially-explicit 
information. 



I consider two major types of (bivariate) spatial clustering patterns, namely, association and segregation 
as alternative patterns. Association occurs if the NN of an individual is more likely to be from another class. 
Segregation occurs if the NN of an individual is more likely to be of the s ame class as the individual; i.e., 
the members of the same class te nd to be clump ed or clustered (see, e.g., IPieloul (J1961I )). For more detail 
on these alternative patterns, see ( CevhanI ( 20071 )1. I assess the effects of QR- adjustment on the size of the 
NNCT-tests under CSR independence and on the power of the tests under the segregation or association 
alternatives by an extensive Monte Carlo study. 



Throughout the article I adopt the convention that random quantities are denoted by capital letters, 
while fixed quantities are denoted by lower case letters. I describe the construction of NNCTs in Section 
12.11 provide Dixon's tests in Sections 12.21 and 12.41 empirical significance levels of the tests in Section [31 two 
illustrative examples in Section [S] and discussion and conclusions in Section [6l 



2 Nearest Neighbor Contingency Tables and Related Tests 



2.1 Construction of the Nearest Neighbor Contingency Tables 



NNCTs are constructed using the NN frequencies of classes. I describe the construction of NNCTs for two 
classes; extension to multi-class case is straightforward. Consider two classes with labels {1,2}. Let iV^ 
be the number of points from class i for i e {1, 2} and n be the total sample size, so n = iVi + A^2- If I 
record the class of each point and the class of its NN, the NN relationships fall into four distinct categories: 
(1,1), (1,2); (2,1), (2,2) where in cell («,j), class i is the base class, while class j is the class of its NN. 
That is, the n points constitute n (base, NN) pairs. Then each pair can be categorized with respect to the 
base label (row categories) and NN label (column categories). Denoting Nij as the frequency of cell (i, j) for 
i, j € {1,2}, I obtain the NNCT in Table [1] where Cj is the sum of column j; i.e., number of times class j 
points serve as NNs for j G {1, 2}. Furthermore, Nij is the cell count for cell (i, j) that is the sum of all (base, 

NN) pairs each of which has label (i, j). Note also that n = ^,- ■ Nij] Ui = X^i^i ^yi ^^^^ 0/ ~ Tlii=i -^y • 
By construction, if Nij is larger (smaller) than expected, then class j serves as NN more (less) to class i 
than expected, which implies (lack of) segregation li i = j and (lack of) association of class j with class i if 
i ^ j. Hence, column sums, cell counts are random, while row sums and the overall sum are fixed quantities 

in a NNCT. 







NN class 
class 1 class 2 


sum 


base class 


class 1 
class 2 


TVn 
N21 


N12 
N22 


n2 




sum 


Ci 


C2 


n 



Table 1: The NNCT for two classes. 



Observe that, under segregation, the diagonal entries, Na for i — 1,2, tend to be larger than expected; 
under association, the off-diagonals tend to be larger than expected. The general alternative is that some 
cell counts are different than expected under CSR independence or RL. 



In the two-class case, iPieloul (|l96l[ ) used Pearson's x^-test of independence to detect any deviation from 
CSR independence or RL. But, u nder CSR independence or RL, this test is libe r al, i.e ., has larger size than 
the nominal level ( CevhanI ( 20061 )). hence not considered in this article. iDixonI ( 19941 ) proposed a series of 



tests for segregation based on NNCTs. He first devised four cell-specific tests in the two-class case, and then 
combined them to form an overall test. For his tests, the probability of an individual from class j serving as 
a NN of an individual from class i depends only on the class sizes (i.e., row sums), but not the total number 
of times class j serves as NNs (i.e., column sums). 



2.2 Dixon's Cell-Specific Tests 

The level of segregation is estimated by comparing the observed cell counts to the expected cell counts under 
RL of points that are fixed. Dixon demons trates that under RL, one can write down the cell frequencies as 
Moran join count statistics ( MoranI ( 1948I)) . He then derives the means, variances, and covariances of the 



cell counts (frequencies) in a NNCT |Dixonl ([7994,2002, )) 



The null hypothesis under RL is given by 



.("■-!) 



ifo : E[i\^,,] = <( „(«„f) '^[ /: (1) 

Observe that the expected cell counts depend only on the size of each class (i.e., row sums), but not on 
column sums. 

The cell-specific test statistics suggested by Dixon are given by 



where 



{n + R)pii -I- (2n — 2R + Q)piii + {n'^ — 3n — Q + R)piiii — {npa)^ if i = j, 
npij + Qpiij + {n'^ -3n- Q + R)Piijj - (npij)^ if i / j, 



with pxx, Pxxx, and Pxxxx are the probabilities that a randomly picked pair, triplet, or quartet of points, 
respectively, are the indicated classes and are given by 

rii [rii ~ 1) 



Pa / 1 \ ' Py 



n[n-l) ' ^'■' n[n-iy 
_ rii {rii - 1) (jii - 2) _ Uj {rij - 1) Uj 

n{n—l)[n — 2) n{n—l)(n — 2) 

Til (rtj - 1) n^ {uj - 1) _ni (rii - 1) (n.^ - 2) (rtj - 3) 



Piijj 



n(n-l)(n-2)(n-3)' "^"^ n (n - 1) (n - 2) (n - 3) 



Furthermore, Q is the number of points with shared NNs, which occur when two or more points share a NN 
and R is twice the number of reflexive pairs. Then Q — 2 (Q2 + 3 Qa 4- 6 (34 4- 10 Qs + 15 Qe) where Qk is 
the number of points that serve as a NN to other points k times. One-sided and two-sided te sts are possib le 
for each cell {i,j) using the asymptotic normal approximation of Zj^ given in Equation ([2]) ( DixonI (11994 



The t est in Equation (|2|) is the same as Dixon's Zaa when i = j = 1; same as Zbb when i = j = 2 (jDixor 
( 19941 )). Note also that in Equation ^ four different tests are defined as there are four cells and each is 



testing the deviation from the null case in the respective cell. These four tests are combined and used in 
defining an overall test of segregation in Section 12.41 



Under CSR independence, the null hypothesis, the test statistics, and the variances are as in the RL case 
for the cell-specific tests, except for the fact that the variances are conditional on Q and R. 



2.3 The Status of Q and R under CSR Independence and RL 

Note the difference in status of the variables Q and R under CSR independence and RL models. Under 
RL, Q and R are fixed quantities; while under CSR independence, they are random. The quantities given 



in Equations H]), ([3]), and all the quantities depending on these expectations also depend on Q and R. 
Hence these expressions are appropriate under the RL pattern. Under CSR independence pattern they are 
conditional variances and covariances obtained by using the observed values of Q and R. The unconditional 
variances and covariances can be obtained by replacing Q and R with their expectations. 

Un fortunate l y, giv en the difficulty of calculating the expectations of Q and R under CSR indepen- 
dence, ICevhanl (J2007I ) employed the conditional variances and covariances (i.e., the variances and covari- 
ances for which observed Q and R values are used) even when assessing their behavior under CSR inde- 
pendence pattern. Alternatively, I can estimate the values of Q and R empirically as follows. I generate 
n E {10,20,30,40,50,100,500,1000} points that are iid (independently and identically distributed) from 
U{{Q, 1) X (0, 1)), the uniform distribution on the unit square. I repeat this procedure Nmc = 1000000 times. 
At each Monte Carlo replication, I calculate Q and R values, and record the ratios Q/n and R/n. I plot these 
ratios in Figure[l]as a function of sample size n. Observe that the ratios seem to converge as n increases. For 
homogeneous planar Poisson pattern, I have E[Q/?i] « .6327860 and E[i?/7T,] « 0.6211200. Hence, I replace 
Q and R by 0.63 n and 0.62 n, respectively, to obtain the QR-adjusted variances and covariances. 



Estimate of Q/n under two-class CSR 



Estimate of R/n under two-class CSR 



■a 
o 



E 




sample size (n) 



sample size (n) 



Figure 1: Plotted arc the empirically estimated expectations E[Q/n] (left) and E[i?/n] (right) as a function 
of total sample size n. 



2.4 Dixon's Overall Segregation Test 



Dixon's overall test of segregation tests the hypothesis that expected cell counts in the NNCT are as in 
Equation ^. In the two-class case, he calculates Zu = {Nu — E [Nu ] ) / -^Z Var [Nu ] for both i € |1,2| 
and co mbines these test statistics into a statistic that is asymptotically distributed as X2 under RL ( DixonI 
([l99J)). The suggested test statistic is given by 



C 



Y'E"^Y 



iVii - E[iVii] 
N22 - E 7V22 



Var[iVn] 

CovWil,iV22] 



Cov[A/'ll,iV22] 

Var[iV22] 



iVii - E[iVii] 

iV22 - E[iV22] 



where E[Nii\ are as in Equation ([T]), VarlNu] arc as in Equation ([3]), and 

Cov[A^ll, N22] = (n^ - 3?!- Q + R)pil22 -'r?PiiP22- 

Dixon's C statistic given in Equation ([5]) can also be written as 



(5) 



(6) 



where r = Cov[iVii, /V22] /v/Var[/Vii]Var[/V22] (JDixonl lfl9& 



Under CSR independence, the expected values, variances and covariances are as in the RL case. However, 
the variance and covariance terms include Q and R which are random under CSR independence and fixed 



under RL. Hence Dixon's test statistic C asymptotically has a ^^-distribution under CSR independence 
conditional on Q and R. Replacing Q and R by their empirical estimates given in Section [2. 3[ I obtain the 
QR-adjusted version of Dixon's test which is denoted by Cqr- 



2.5 Version I of the New Segregation Tests 



CevhanI ([2007|) proposed tests based on the correct sampling distribution of the cell counts in a NNCT under 
CSR indep endence or RL . In defining the new segregation or clustering tests, I follow a track similar to that 
of Dixon's ( DixonI ( 19941 )) where he defines a cell-specific test statistic for each cell and then combines these 
four tests into an overall test. 



For cell (i, j), let 



T^ — /V 



tli Cyj 



and then let M 

''J 






yJriiCjIi 






(7) 



Furthermore, let Nj be the vector of A^/ values concatenated row-wise and let S/ be the variancc-covariance 
matrix of Nj based on the correct sampling distribution of the cell counts. That is, E/ = (Cov \nL, ^fcj) 
where 



Co^r[Nl^,Nli] = 



y/fHcJnTci 



Cov [N,,,Nk 



with Cov [Nij , Nki] is as in Equation ^ if {i,j) = {k,l) and as in Equation © if (»,.? ) = (1 , 1) an d 
(fc, I) = (2, 2). Since E/ is not invertible, I use its genera lized inv e rse w hich is denoted by T,J (jSearld (|2006( )). 
Then the first version of segregation tests suggested by I CevhanI (|2007l ) is 



Xf = Nil]7Ni 



(8) 



which asymptotically has a Xi distribution. 



Under CSR independence, the expected values, variances, and covariances related to Xf are as in the 
RL case, except they are not only conditional on column sums (i.e., on Cj = Cj), but also conditional on 
Q and R. Hence Xf has asymptotically Xi distribution conditional on column sums, Q and R under CSR 
independence. Replacing Q and R by their empirical estimates given in Section r2.3[ I obtain the QR-adjusted 



version of this test which is denoted by Xf , which is still conditional on column sums. 



2.6 Version II of the New Segregation Tests 



For ceU (i,j), let 



rii^j^ !^ljy and then let N'' - ^'' 

'J ■' n '■J 



II 



{N,j - n,nj/n) 



n 



^Juirij/n yjuinj/i 



(9) 



Furthermore, let Nh be the vector of N-^^ concatenated row-wise and let E// be the variance-covariance 
matrix of Nn based on the correct sampling distribution of the cell counts. That is, E// — (Cov \N-^ , ^ki] ) 
where 

Cov [Nl/, Nil] = , "" Cov [7V.„ TVh] . 

^n.i Hj Uk ni 

Since E j / is n ot invertible, I use its generalized inverse Ejj. Then second version of the tests proposed by 
CevhanI (|2007[) is 



X^j = NiiE7,Nii 



(10) 



which asymptotically has a X2 distribution under RL. Note that E// can be obtained from E used in Equation 
([5]) by multiplying E entry-wise with the matrix C{f = ( ^=====: ) . This version of the segregation test 
is asymptotically equivalent to Dixon's segregation test. 



^7ii Uj Uk ni 



Under CSR independence, the expectations, variances, and covariances related to Xfj are as in the RL 
case, but the variances and covariances are conditional on Q and R. Hence, the asymptotic distribution 
of Xfj is also conditional on Q and R. Replacing Q and R with their empirical estimates, I obtain the 
QR-adjusted version of this test which is denoted by X^j and is not conditional any more. 

2.7 Version III of the New Segregation Tests 

Notice that version I is a conditional test (conditional on column sums), while version II is asymptotically 
equivalent to Dixon's test. Furthermore, both Dixon's test and version II incorporate only row sums (i.e., 
class sizes) in the NNCTs. 



CevhanI ( 20071 ) suggests another test statistic which uses both the column sums (i.e., number of times a 



class serves as NN) and row sums and is not conditional on the column sums. Let 

Let Niii be the vector of Tl^^ values concatenated row- wise and let E/// be the variance-covariance matrix 
of Niii based on the correct sampling distribution of the cell counts. That is, S/// = (Cov [^"1^^,^^/^]) 
where the explicit forms of Cov \Tij^ , ^fc/^] ^I'S provided in (jCevhanl ( 20071 )). Since S/// is not invertible, I 



^13 .„„ ^ 
use its generalized inverse S^^^. Then the proposed test statistic by ( CevhanI (2007)) for overall segregation 



is the quadratic form Xfjj ~ NjjjSjjjNm which asymptotically has a Xi distribution. 

Under CSR independence, the discussion related to and derivation of Xfjj are as in the RL case; however, 
the variance and covariance terms (hence the asymptotic distribution) are conditional on Q and R. Replacing 
Q and R with their empirical estimates, I obtain the QR-adjusted version of this test which is denoted by 

'^Ill.qr- 

Remark 2.1. Extension to Multi-Class Case: So far, I have described the segregation tests for the two 
class case in which the corresponding NNCT is of dimension 2x2. The cell counts for the diagonal cells have 
asymptotic normality. Fo r the of f -diago nal cells, although the asymptotic normality is supported by Monte 
Carlo simulation results ( DixonI (2002)), it is not rigorously proven yet. Nevertheless, if the asymptotic 



normality held for all q^ cell counts in the NNCT, under RL, Dixon's test and version II would have x^r -d 
distribution, versions I and III would have x?„_i)2 distribution asymptotically. Under CSR independence, 
these tests will have the corresponding asymptotic distributions conditional on Q and R. The QR-adjusted 
versions can be obtained by replacing Q and R with their empirical estimates. 



3 Empirical Significance Levels of NNCT-Tests under the CSR 
Independence 

For the null case. Ho : CSR independence, I simulate the CSR case only with classes 1 and 2 (i.e., X and 
Y) of sizes TT-i and 77,2, respectively. At each of Nmc = 10000 replicates, I generate data for some sample size 
combinations of ni, 712 € {10, 30, 50, 100} points iid from U{{Q, 1) x (0, 1)). These sample size combinations 
are chosen so that one can examine the influence of small and large samples, and the relative abundance of the 
classes on the tests. The corresponding test statistics are recorded at each Monte Carlo replication for each 
sample size combination. Then I record how many times the p- value is at or below a = .05 for each test to 
estimate the empirical size. I present the empirical sizes for NNCT-tests in Table[2l where an is the empirical 
significance level for Dixon's test, a/, ajj and am are for versions I, II, and III, respectively, and ajj^qr, 
ci-i,qr, ciii,qr and ciiii^qr are for the corresponding QR-adjusted versions. The empirical sizes significantly 
smaller (larger) than .05 are marked with ^ (^), which indicate that the corresponding test is conservative 
(liberal) . The asymptotic normal approximation to proportions is used in determining the significance of the 
deviations of the empirical size estimates from the nominal level of .05. For these proportion tests, I also use 



a = .05 to test against empirical size being equal to .05. With Nmc = 
than .0464 (.0536) are deemed conservative (liberal) at a = .05 level. 



10000; empirical sizes less (greater) 



Observe that the (unadjusted) NNCT-tests are about the desired level (or size) when ni and n2 are both 
> 30, and mostly conservative otherwise. The same trend holds for the QR-adjusted versions. Furthermore, 
comparing the empirical sizes of QR-adjusted versions with those of unadjusted ones, I see that for almost 
all cases they are not significantly different (at a = .05 based on tests on equality of the proportions). 



Empirical significance levels of the NNCT-tests 




conditional (i.e., unadjusted) 


unconditional (i.e., QR-adjusted) 


ini,n2) 


ao 


ai 


aji 


ajii 


an.qr 


ai,qr 


ail^qr 


OLIII,qr 


(10,10) 


.0432= 


.0593^ 


.0461= 


.0439= 


.0470 


.0595^ 


.0486 


.0365='< 


(10,30) 


.0440= 


.0451= 


.0421= 


.0410= 


.0411= 


.0465 


.0381= 


.0461=^> 


(10,50) 


.0482 


.0335= 


.0423= 


.0397= 


.0497 


.0345= 


.0411= 


.0431= 


(30,10) 


.0390= 


.0411= 


.0383= 


.0391= 


.0402= 


.0423= 


.0379= 


.0436= 


(30,30) 


.0464 


.0544^ 


.0476 


.0427= 


.0492 


.0552^ 


.0478 


.0409= 


(30,50) 


.0454= 


.0507 


.0481 


.0504 


.0411= 


.0517 


.0464 


.0515 


(50,10) 


.0529 


.0326= 


.0468 


.0379= 


.0510 


.0334= 


.0428= 


.0402= 


(50,30) 


.0429= 


.0494 


.0468 


.0469 


.0405= 


.0518 


.0466 


.0492 


(50,50) 


.0508 


.0494 


.0497 


.0499 


.0528 


.0494 


.0524 


.0488 


(50,100) 


.0560^= 


.0501 


.0564'^ 


.0516 


.0556'^ 


.0493 


.0573 


.0494 


(100,50) 


.0483 


.0463= 


.0492 


.0479 


.0495 


.0457 


.0501 


.0460 


(100,100) 


.0504 


.0524 


.0519 


.0489 


.0513 


.0524 


.0523 


.0463= 



Table 2: The empirical significance levels for Dixon's, and the new versions of the NNCT-tests by (jCevhan 
( 20071 )) as well as their QR-adjusted versions based on 10000 Monte Carlo simulations of CSR independence 
pattern, ao stands for the empirical significance level for Dixon's test, S/, S// and am for versions 1, II, 
and III, respectively; and au.qr, ajqr, an^qr and am^qr stand for the corresponding QR-adjusted versions. 
(= (^): the empirical size is significantly smaller (larger) than .05; i.e., the test is conservative (liberal). < (>): 
the empirical size of QR-adjusted version is significantly smaller (larger) than that of unadjusted version.) 



4 Empirical Power Analysis 



To evaluate the power performance of the QR-adjusted and unadjusted NNCT-tests, I only consider alter- 
natives against the CSR pattern. That is, the points are generated in such a way that they are from an 
inhomogeneous Poisson process in a region of interest (unit square in the simulations) for at least one class. 
Furthermore, the tests considered in this article seem to have the desired nominal level for large samples 
under CSR, and QR-adjustment is not necessary under the RL pattern. Hence I avoid the alternatives 
against the RL pattern; i.e., I do not consider non-random labeling of a fixed set of points that would result 
in segregation or association. 



4.1 Empirical Power Analysis under Segregation Alternatives 



For the segregation alternatives (against the CSR pattern), three cases are considered. I generate Xi ^ 

U{{0, 1 — s) X (0, 1 — s)) for i = 1, 2, . . . , ni and Y, ^^ U{{s, 1) x (s, 1)) for j = 1, 2, . . . , 712. In the pattern 
generated, appropriate choices of s will imply Xi and Yj to be more segregated than expected under CSR. 
That is, it will be more likely to have {X,X) NN pairs than mixed NN pairs (i.e., {X,Y) or (Y^X) pairs). 
The three values of s I consider constitute the three segregation alternatives: 



Hi : s 



5 . ^ - 1/6, Hi;' : s^ 1/4, and Hi,'' : s = 1/3. 



(12) 



Observe that, from Hg to Hg^^ (i.e., as s increases), the segregation gets stronger in the sense that X and Y 
points tend to form one-class clumps or clusters. By construction, the points are uniformly generated, hence 



exhibit homogeneity with respect to their supports for each class, but with respect to the unit square these 
alternative patterns are examples of departures from first-order homogeneity which implies segregation of 
the classes X and Y . The simulated segregation patterns are symmetric in the sense that, X and Y classes 
are equally segregated (or clustered) from each other. 
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Figure 2: Three realizations for Hg : s 



1/6, i/f 



(solid squares ■) and n2 — 100 Y points (triangles A) 



1/4, and H'g'' 



s = 1/3 with Til = 100 X points 



The power estimates against the sample size combinations are presented in Figure [31 where /?£> is for 
Dixon's test, /3/, /J//, and pm are for versions I, II , and III, respectively, and the QR-adjusted versions 
are indicated by qr in their subscripts. Observe that, as n = (rti + 122) gets larger, the power estimates get 
larger. For the same n = (ni + 712) values, the power estimate is larger for classes with similar sample sizes. 
Furthermore, as the segregation gets stronger, the power estimates get larger. The NNCT-tests have about 
the same power performance under these segregation alternatives. Notice also that for small samples the 
power estimates of the QR-adjusted versions are slightly larger but for other sample size combinations the 
power estimates for the QR-adjusted versions and the unadjusted versions are virtually indistinguishable. 



4.2 Empirical Power Analysis under Association Alternatives 

For the association alternatives (against the CSR pattern), I also consider three cases. First, I generate 

Xi ^ W((0, 1) X (0, 1)) for i = 1, 2, . . . , Til. Then I generate Yj for j = 1, 2, . . . , 712 as follows. For each j, 

I pick an i randomly, then generate Yj as Xi + Rj (cosTj^smTj)' where Rj ~ U{0,r) with r E (0,1) and 

Tj ~ U{0, 2 7r). In the pattern generated, appropriate choices of r will imply Yj and Xi are more associated 
than expected. That is, it will be more likely to have {X,Y) NN pairs than self NN pairs (i.e., {X,X) or 
{Y, Y)). The three values of r I consider constitute the three association alternatives: 



Hi 



1/4, 



Ha 



1/7, and W^ 



III 



1/10. 



(13) 
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Figure 3: Empirical power estimates for the QR-adjusted and unadjusted NNCT-tests based on 10000 Monte 
Carlo replications under the segregation alternatives. The numbers in the horizontal axis labels represent 
sample (i.e., class) size combinations: 1=(10,10), 2=(1G,30), 3==:(1G,50), 4=(30,10), 5=(30,30), 6=(30,50), 
7=(50,10), 8=(50,30), 9=(50,50), 10==(50,100), 11==(100,50), 12=(100,100). 



Observe that, from iJ^ to Hj^^ (i.e., as r decreases), the association gets stronger in the sense that X and Y 
points tend to occur together more and more frequently. By construction, X points are from a homogeneous 
Poisson process with respect to the unit square, while Y points exhibit inhomogeneity in the same region. 
Furthermore, these alternative patterns are examples of departures from second-order homogeneity which 
implies association of the class Y with class X. 
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and 712 



r = 1/4, i7;^^ : s 
100 Y points (triangles A). 
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The power estimates under the association alternatives are presented in Figure [5l where labeling is as 
in Figure [21 Observe that, for similar sample sizes as n = (rii + 712) gets larger, the power estimates get 
larger at each association alternative. Furthermore, as the association gets stronger, the power estimates get 
larger at each sample size combination. The NNCT-tests have about the same power estimates under these 
association alternatives. Furthermore the QR-adjusted versions of the tests virtually have the same power 
estimates as the unadjusted versions; for the smaller samples QR-adjusted version has slightly lower power 
estimates. 

Remark 4.1. Main Result of Monte Carlo Simulation Analysis: Based on the simulation results under 
CSR independence of the points, I observe that none of the NNCT-tests I consider has the desired level when 
at least one sample size is small so that the cell count (s) in the corresponding NNCT have a high probability 
of being < 5. This usually corresponds to the case that at least one sample size is < 10 or the sample 
sizes are very different in the simulation study. When sample sizes are small (hence the corr espond i ng cel l 
counts are < 5), the asymptotic approximation of the NNCT-tests is not appropriate. So lOixonl (jl994l ) 
recommends Monte Carlo randomization for his test when some cell count (s) are < 5 in a NNCT. I extend 
this recommendation for all the NNCT-tests discussed in this article. Furthermore, among the NNCT-tests, 
Dixon's and version III tests seem to be affected by the QR-adjustment more than the other tests in terms of 
empirical size. But QR-adjustment does not necessarily improve the results of the NNCT-analysis under CSR 
independence, as the empirical sizes of the adjusted and unadjusted versions are not significantly different. 
Furthermore, the QR-adjustment does not significantly improve the power performance under segregation 
and association alternatives. In fact the power estimates of QR-adjusted and unadjusted tests were about 
the same under these alternatives. 
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Figure 5: Empirical power estimates for the QR-adjusted and unadjusted NNCT-tests under the associ- 
ation alternatives. The numbers in the horizontal axis labels represent sample (i.e., class) size combina- 
tions: 1=(10,10), 2=(10,30), 3=(10,50), 4=(30,10), 5=(30,30), 6=(30,50), 7=(50,10), 8=(50,30), 9=(50,50), 
10==(50,100), 11=(100,50), 12=(100,100). 



5 Examples 



I illus trate the tests on two examples: an ecological data set, namely swamp tree data ([Good and Whipple 
(1983)), and an artificial data set. 



5.1 Swamp Tree Data 



Good and Whipple! ( 1982f ) consid ered the spat ial patterns of tree species along the Savannah River, South 
Carolina, U.S.A. From this data. IPixonl ( 20021 ) used a single 50m x 200m rectangular plot to illustrate his 
tests. All live or dead trees with 4.5 cm or more dbh (diameter at breast height) were recorded together 
with their species. Hence it is an example of a realization of a marked multi-variate point pattern. The plot 
contains 13 different tree species, four of which comprise over 90 % of the 734 tree stems. The remaining tree 
stems were categorized as "other trees". The plot consists of 215 water tupelo {Nyssa aquatica), 205 black 
gum (Nyssa sylvatica), 156 Carolina ash [Fraxinus caroliniana), 98 bald cypress [Taxodium distichum), and 
60 stems of 8 additional species (i.e., other species). I will only consider live trees from the two most frequent 
tree species in this data set (i.e., water tupelos and black gums). So a 2 x 2 NNCT- analysis is conducted for 
this data set. If segregation among the less frequent species were important, a more detailed 5x5oral2xl2 
NNCT-analysis should be performed. The locations of these trees in the study region are plotted in Figure [H] 
and the corresponding 2x2 NNCT together with percentages based on row and grand sums are provided in 
Table [31 For example, for water tupelo as the base species and black gum as the NN species, the cell count 
is 54 which is 26 % of the 211 black gums (which is 54 % of all 394 trees). Observe that the percentages 
and Figure [6] are suggestive of segregation for all three tree species since the observed percentages of species 
with themselves as the NN are much larger than the row percentages. 



Swamp Tree Data 
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Figure 6: The scatter plot of the locations of water tupelos (circles o) and black gums (triangles A). 







NN species 
W.T. B.G. 


sum 


base species 


W.T. 
B.G. 


157 (74 %) 54 (26 %) 
52 (28 %) 131 (72 %) 


211 (54 %) 
183 (46 %) 




sum 


209 (53 %) 185 (47 %) 


394 (100 %) 



Table 3: The NNCT for swamp tree data and the corresponding percentages (in parentheses), where the 
cell percentages are with respect to the row sums and marginal percentages are with respect to the total 
size. W.T. = water tupelos and B.G. = black gums. 



The locations of the tree species can be viewed a priori resulting from different processes so the more 
appropriate null hypothesis is the CSR independence pattern. Hence our inference will be a conditional one 
(see Section 12. 3p if I use the observed values of Q and R. I observe Q = 270 and R = 236 for this data 



set, and the empirical estimates for these sample sizes are Q ~ 249.68 and R = 244.95. I present the tests 
statistics and the associated p- values for NNCT-tests in Table H] Observe that the test statistics all decrease 
with the QR-adjustment, however this decrease is not substantial to alter the conclusions. Based on the 
NNCT-tests, I find that the segregation between both species is significant, since all the tests considered 
yield significant p- values, and the diagonal cells (i.e., cells (1, 1) and (2, 2)) are larger than expected. 



NNCT-test statistics and the associated p-values 
for swamp tree data 


C 


Xj 


xh 


"^iii 


52.72 
(< .0001) 


52.08 
(< .0001) 


52.14 
(< .0001) 


52.66 
(< .0001) 


L.'qr 


■^/qr 




^Ill.qr 


51.98 
(< .0001) 


51.35 
(< .0001) 


51.41 
(< .0001) 


51.92 

(< .0001) 



Table 4: Test statistics and the associated p-values (in parentheses) for NNCT-tests for the swamp tree 
data set . C st ands for Dixon's overall test, Xf, Xfj, and Xfjj stand for versions I, II, and III of the tests by 

■2 v2 

/,gr' '^Il.qr' 



CevhanI (|2007( ). Cqr, Xf^^., Xfj^^, and Xfjj are the QR-adjusted versions of these tests. 



5.2 Artificial Data Set 

In the swamp tree example, although the test statistics for unadjusted and QR-adjusted versions are different 
for Pielou's and Dixon's tests and p- values for QR-adjusted versions are larger than unadjusted ones, I have 
the same conclusion: there is strong evidence for segregation of tree species. Below, I present an artificial 
example, a random sample of size 100 (with 50 X-points and 50 K-points uniformly generated on the unit 
square). The question of interest is the spatial interaction between X and Y classes. I plot the locations 
of the points in Figure [7] and the corresponding NNCT together with percentages are provided in Table El 
Observe that the percentages are suggestive of mild segregation, with equal degree for both classes. 



Artificial Data 
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Figure 7: The scatter plot of the locations of X (circles o) and Y points (triangles A) in the artificial data 
set. 



The data is generated to resemble the CSR independence pattern, so I assume the null pattern is CSR 
independence, which implies that our inference will be a conditional one if I use the observed values of Q 
and R. I observe Q = 70 and i? = 60 for this data set, and the empirical estimates for these sample sizes 
are Q = 63.37 and R = 62.17. I present the tests statistics and the associated p-values for NNCT-tests in 







NN class 
X Y 


sum 


base class 


X 
Y 


30 (60 %) 20 (40 %) 
19 (38 %) 31 (62 %) 


50 (50 %) 
50 (50 %) 




sum 


49 (49 %) 51 (51 %) 


100 (100 %) 



Tabic 5: The NNCT for the artificial data and the corresponding percentages (in parentheses), where the 
cell percentages arc with respect to the row sums and marginal percentages are with respect to the total 
size. 

Table [6l Observe that the test statistics all decrease with the QR-adjustment. however this decrease is not 
substantial to alter the conclusions. Based on the NNCT-tests, I find that the spatial interaction between 
X and Y is not significantly different from GSR independence. 

In both examples although QR-adjustment did not change the conclusions, it might make a difference if 
the pattern is a close call between CSR independence and segregation/association. That is, if a segregation 
test has a p- value about .05, after the QR-adjustment, it might get to be significant or insignificant, depending 
on the case. 



NNCT-test statistics and the associated 
p-values for the artificial data 


C 


Xf 


^h 


-^//j 


3.36 

(.1868) 


3.02 

(.0825) 


3.07 

(.2152) 


3.30 
(.0693) 


Cqr 


'^I.qr 


^Il.qr 


'^Ill.qr 


3.32 

(.1906) 


2.97 
(.0846) 


3.04 
(.2192) 


3.25 
(.0713) 



Table 6: Test statistics and the associated p- values (in parentheses) for NNCT-tests for the artificial data 
set. The notation for the tests is as in[31 



6 Discussion and Conclusions 



In this article. I discuss the effect of QR-adjustment on segregation or clusterin g tests based on nearest 
neighbor contingency tables (NNCTs). These tes ts include Dixo n's overall test ( DixonI ( 19941 )). and the 
three new overall segregation tests introduced by ( CevhanI ( 20071 )). QR-adjustment is performed on these 
tests based on NNCTs (i.e., NNCT-tests) when the null case is the CSR of two classes of points (i.e., CSR 
independence), since under CSR independence, the NNCT-tests depend on number of reflexive NNs (denoted 
by R) and the number of shared NNs (denoted by Q), both of which depend on the allocation of the points. 
When the observed values of Q and R are used, the NNCT-tests are conditional tests, which might bias 
the results of the analysis. Given the difficulty in calculating the expected values of Q and R under CSR 
independence, I estimate them empirically based on extensive Monte Carlo simulations, and substitute these 
estimates for expected values of Q and R (which is called the QR-adjustment in this article). 

I compare the empirical sizes and power estimates of the NNCT-tests with extensive Monte Carlo simu- 
lations. Based on the Monte Carlo analysis, I find that QR-adjustment does not affect the empirical sizes of 
the tests. Moreover, QR-adjustment does not have a substantial influence on these NNCT-tests under the 
segregation or association alternatives. Thus, one can use the QR-adjusted or the unadjusted versions of the 

NNCT-tests. 
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